Machine Learning in Finance

Module 3

Matthew G. Son

University of South Florida

Density-based Clustering

Hierarchical Clustering

  • Starts by each point as a cluster by itself

  • Combine the nearest clusters

  • Keep combining until all of them are clustered to one

  • Does not require specifying number of clusters

Hierarchical Clustering

Dendrogram

Choosing Number of Clusters

The length of dendrogram lines represent the distance between combined clusters

Rule of thumb: draw a horizontal line that crosses “large” distances

Examples

Density-based cluserting works well for perceptional data:

Princial Component Analysis

PCA

A Very popular algorithm in Machine Leaning and Finance.

A statistical method to reduce the dimensionality

  • retainingas much information (or variance) as possible

Main Idea

Dimensionality reduction is like taking a photo

  • Reduce dimensions: 4D (?) -> 2D
  • Make sure everybody is visible (keep maximum information)
  • Getting the right “angle” is important

A right angle yields the best photo

Another example

Reduced dimensions can make (huge) distortions when not done properly

I rather want distortions!!

PCA

Formally PCA is:

  • Linearly transformation of original variables (p) into a new set variables (q <= p) such that

    • New variables (principal components) are uncorrelated
    • First principal component have maximum amount of variance
    • then second, third, …

Use of PCA

Main use of PCA is the dimensionality reduction

  • Powerful when large amount of features are redundant
    • i.e., feature reduction algorithm
  • Usually combined with other algorithms
    • As a preprocessing in ML
    • With K-means: bypass curse of dimensionality
    • PCR: Principal Component Regression

More Technically: PCA

An orthogonal projection of data into lower dimension space: (e.g., 2D -> 1D)

  • that minimizes distance from original points (red) to projection (green)

  • that retains maximum variance between projected data points

We can use a purple line (PC) instead of two lines (dim1 and dim2)

Projection

Note when the red dots are:

  1. Spread out the most (maximum variance): Best
  2. Closely packed (minimum variance) : Maxium distortion (worst)

Getting the right angle is the key

Why maximum variance?

Note that each dots are “distinct”:

In order to make them distinguishable in the lower dimension space, they must be spread out as much as possible.

Contrarily, if many of dots are crammed and overlapped, then we cannot distinguish them in lower dimension space.

Even more Technical: Computation behind PCA

  1. Eigen Decomposition (EVD)
  • Eigenvector is the new vector direction

  • Eigenvalue is similar to the length of eigenvector

    • The information content (or importance)

    • It is variance of the corresponding principal component

    • Generally faster

  1. Singular Value Decomposition (SVD)
  • Works on \(m \times n\) matrix

  • Scales well, numerically stable because it does not require computing covariance

PCA Algorithm (EVD)

  1. Standardize the data with \(p\) variables: \(X_{n \times p} \rightarrow Z_{n \times p}\)
  2. Compute the Gram matrix (or covariance matrix): \(G_{p \times p} = \frac{1}{n-1} Z^T Z\).
  1. Perform Eigenvalue Decomposition on \(G\): \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
  • \(V\): Matrix of eigenvectors (new directions, principal components).
  • \(\Lambda\): Diagonal matrix with eigenvalues (variances explained, in descending order).
  1. Compute the principal components by projecting \(Z\) onto \(V\):

\[PC_{n \times p} = Z V_{p \times p}\]

  1. Variance Explained
  • \(\Lambda\) contains the eigenvalues (variance explained by the \(i\)-th principal component).
  • Percentage explained by the \(i\)-th component: \[\frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}\]
  1. Factor Loadings \[L_{p \times p} = V_{p \times p} \sqrt{\Lambda_{p \times p}}\]
  • \(L\): Matrix of factor loadings, where \(L_{ij}\) is the loading of variable \(i\) on PC \(j\).

Factor Loadings

Factor loadings are weighted eigenvectors. Needed for Interpretable Machine Learning.

\[L_{p \times p} = V_{p \times p} \sqrt{\Lambda_{p \times p}}\]

\(L_{ij}\) is the loading of variable \(i\) on PC \(j\).

Interpretation:

  • High absolute values in \(L\) mean a variable strongly influences that PC.
  • Positive/negative signs show the direction of the relationship.

Takeaways of PCA

If you have a data with 1,000 observations and 200 variables:

Q1. How many principal components (PCs) will I have?

  • It will still give you 200 PCs (if not specified further)

Q2. How’s PCA output look like?

  • Same dimensions with changed numbers, 1,000 observations and 200 variables.

Q3. How can it reduce the variables, then?

  • From the PC output, remove the rightmost columns as many as you want.

Q4. How many PCs should I choose?

  • Depends, but using variance explained makes most sense.

Q5. Which PC is the most important?

  • The first column. It’s arranged automatically.

Q6. What do you mean most important?

  • Most variance, or information content.

Q7. Is that importance same as Eigenvalue?

  • That is correct. Each column has its own eigenvalues.

Q8. How many Eigenvalues will it have?

  • Same number of original, 200.

Algorithm Summary

Input:

An data with \(n\) observations and \(p\) features

\[ X_{n\times p} \]

Primary Output:

A data with \(n\) observations and new \(q\) features (\(q \leq p\))

\[ PC_{n \times q} \]

Extra output:

  1. Explained variance (Eigenvalues) by each principal components

\[ \lambda_{1,2,3,...,q} \]

  1. Factor loadings for \(q\) component

\[ L_{p \times q} \]

Factor loadings: How those new \(q\) principal components are constructed by original data?

Step ty Step in Code

Steps

  1. Standardize the data with
  2. Compute the Gram matrix (or covariance matrix): \(G_{p \times p} = \frac{1}{n-1} Z^T Z\).
  3. Perform Eigenvalue Decomposition on \(G\): \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
  4. Compute the principal components by projecting \(Z\) onto \(V\):
  5. Variance Explained
  6. Factor Loadings

Notes

Usually ML packages handels all steps at once, and shows the output summary nicely.

The code is to demonstrate the steps performed behind the scene.

For simplicity, I’m intentionally making two variables that are highly correlated.

  • If two are (highly) correlated, then the information gain from the other is less

Sample data

Create Y as linearly correlated variable. This is to show how PCA captures the maximum variance when reducing dimensions.

# Generate sample data
set.seed(123) # For reproducibility
X <- rnorm(100)
Y <- 2 * X + rnorm(100, sd = 0.5) # Y is linearly dependent on X with some noise
data <- tibble(X = X, Y = Y)

Step 1. Standardize

Z-score standardization:

Z <- data |>
  mutate(across(everything(), \(x) (x - mean(x)) / sd(x))) |>
  as.matrix()

Step 2. Gram Matrix

\[G_{p \times p} = \frac{1}{n-1} Z^T Z\]

G <- 1 / (100 - 1) * t(Z) %*% Z

Step 3. Eigen Decomposition

Eigen Value Decomposition: \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]

eigen_result <- eigen(G)
# Eigenvalues
eigenvalues <- eigen_result$values


# Eigenvectors
V <- eigen_result$vectors

Lambda <- diag(eigenvalues)

Step 4. Generate Principal Components

\[PC_{n \times p} = ZV \]

PC <- Z %*% V
dim(PC)
[1] 100   2

Variance Explained

Eigenvalues represent the variance captured by each principal component. Calculate the proportion of variance explained:

variance_explained <- eigenvalues / sum(eigenvalues)
variance_explained # 98.2% explained by PC1
[1] 0.98295351 0.01704649

Factor Loadings

Factor loadings help understand how original variable contribute to principal components. \[L = V\cdot \text{diag}(\Lambda)\]

# X and Y are equally contributing to the first component
V %*% diag(Lambda)
         [,1]
[1,] 1.365999
[2,] 1.414214
  • X and Y equally contribute to construct PC1

Visual Summary

Out of 2 dimension data:

  • PC1 captures maximum variablity of the data

  • PC2 captures the remaining variability

  • The more correlated, the more effective dimensionality reductions

Visual Summary

Variance Explained by Original

Variance Explained by PCs

Lab Walkthrough

H2O

In this step, we perform Principal Component Analysis (PCA) using the H2O framework.

  • First, the dataset is converted to an H2O frame while excluding non-numeric columns (Country, Abbrev).

  • The PCA is performed using the h2o.prcomp() function:

    • k = 4: specifies the number of principal components to compute.

    • transform = "STANDARDIZE": standardizes (center and scale) all variables before applying PCA.

Country risk data

  1. Real GDP growth (from IMF) : Higher the better

  2. Corruption Index (Transparency International) : Higher the better (no corruption)

  3. Peace Index (Institute for Economics and Peace) : Lower the better (very peaceful)

  4. Legal risk index (Property Rights Association) : Higher the better (favorable)

Browse data

# library(tidyverse)
country_risk <- read_csv("ml_data/Country Risk 2019 Data.csv") |>
  janitor::clean_names()
country_risk |>
  head()
# A tibble: 6 × 6
  country   abbrev corruption peace legal gdp_growth
  <chr>     <chr>       <dbl> <dbl> <dbl>      <dbl>
1 Albania   AL             35  1.82  4.55       2.98
2 Algeria   DZ             35  2.22  4.43       2.55
3 Argentina AR             45  1.99  5.09      -3.06
4 Armenia   AM             42  2.29  4.81       6   
5 Australia AU             77  1.42  8.36       1.71
6 Austria   AT             77  1.29  8.09       1.60

Variable Correlations

library(GGally)
ggpairs(
  country_risk |>
    select(-country, -abbrev)
)

Initiate H2O

Initiate H2O for ML

library(h2o)
h2o.init()

Build PCA model

# Convert data to H2O frame, removing non-numeric columns
country_risk_h2o <- as.h2o(
  country_risk
)

# Build PCA model
pca_h2o <- h2o.prcomp(
  training_frame = country_risk_h2o,
  x = c("corruption", "peace", "legal", "gdp_growth"),
  k = 4, # number of pricipal components (in this case, p = q)
  transform = "STANDARDIZE" # center & scale data
)

Calculate Principal Components

To generate PCs, simply make prediction with the PCA model.

h2o.predict(pca_h2o, country_risk_h2o)

Variance Explained

The model summary provides details of variance explained.

pca_h2o
Model Details:
==============

H2ODimReductionModel: pca
Model ID:  PCA_model_R_1771945215033_165 
Importance of components: 
                            pc1      pc2      pc3      pc4
Standard deviation     1.600254 1.001183 0.614453 0.243450
Proportion of Variance 0.640203 0.250592 0.094388 0.014817
Cumulative Proportion  0.640203 0.890795 0.985183 1.000000


H2ODimReductionMetrics: pca

No model metrics available for PCA

Factor Loadings

To see the factor loadings for each PC: we need to pull eigenvalues and eigenvectors.

# eigenvectors
V <- as.matrix(pca_h2o@model$eigenvectors)
sqrt_Lambda <- diag(pca_h2o@model$importance[1, ])

L <- V %*% sqrt_Lambda

Homework Exercise

mtcars data

With mtcars data,

Perform PCA and report explained variances of each principal component.

  • Reduce dimensions into 2 using PCA.

  • Report Principal Component dataframe with 2 columns

  • How many variables should there be to explain at least 95% or variations?

Homework Reading

Homework Reading

John C. Hull “Machine Learning in Business”

  • Chapter 2